Unstructured Data Integration through Automata-Driven Information Extraction
نویسندگان
چکیده
Extracting information from plain text and restructuring them into relational databases raise a challenge as how to locate relevant information and update database records accordingly. In this paper, we propose a wrapper to efficiently extract information from unstructured documents, containing plain text expressed with natural-like language. Our extraction approach is based on the automata formalism to describe the wrapping process running from text documents to Databases. As usual, relevant information in the text document are delimited by regular expressions, which define the extracting automaton. Each automaton is enriched by an output function that automatically generates SQL queries synchronized with the extracting process in order to insert extracted data into database records. We validate our extraction approach with automaton-based prototype to extract legal information about Lebanese official journal decrees and automatically insert them into a relational database.
منابع مشابه
Unstructured information integration through data-driven similarity discovery
Information integration from multiple heterogeneous sources is one of the major challenges facing enterprises and service providers today, and one of the important problems in this domain is the integration of structured and unstructured (or text) data. In this paper we describe our work on a data-driven approach to integrating various sources of text data, without relying on the availability o...
متن کاملOntology-driven Information Extraction
Homogeneous unstructured data (HUD) are collections of unstructured documents that share common properties, such as similar layout, common file format, or common domain of values. Building on such properties, it would be desirable to automatically process HUD to access the main information through a semantic layer – typically an ontology – called semantic view. Hence, we propose an ontology-bas...
متن کاملMethods for Ontology-Driven Integration
This paper describes the motivations, approach, and architecture for using ontologies in knowledge extraction and in applications that assist situated agents in complex information integration tasks. Our approach applies ontologies along with semantic analysis methods to extract task relevant knowledge from distributed, unstructured text sources. This knowledge is then applied to assist in info...
متن کاملOntology Driven Web Extraction from Semi-structured and Unstructured Data for B2B Market Analysis
The Market Blended Insight project has the objective of improving the UK business to business marketing performance using the semantic web technologies. In this project, we are implementing an ontology driven web extraction and translation framework to supplement our backend triple store of UK companies, people and geographical information. It deals with both the semi-structured data and the un...
متن کاملA Mutually Beneficial Integration of Data Mining and Information Extraction
Text mining concerns applying data mining techniques to unstructured text. Information extraction (IE) is a form of shallow text understanding that locates specific pieces of data in natural language documents, transforming unstructured text into a structured database. This paper describes a system called DISCOTEX, that combines IE and data mining methodologies to perform text mining as well as...
متن کامل